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!— '« Abstract 

. We compare numerically computed resonances of the human vocal 

| tract with formants that have been extracted from speech during vowel 

' pronunciation. The geometry of the vocal tract has been obtained by 

■ MRI from a male subject, and the corresponding speech has been 
\ recorded simultaneously. The resonances are computed by solving 

the Helmholtz partial differential equation with the Finite Element 
Method (FEM). 

. Despite a rudimentary exterior space acoustics model, i.e., the 

Dirichlet boundary condition at the mouth opening, the computed 
• ■ resonance structure differs from the measured formant structure by ~ 

■ 0.7 semitones for [i] and [u] having small mouth opening area, and by 
~ 3 semitones for vowels [a] and [ae:] that have a larger mouth open- 
ing. The contribution of the possibly open velar port has not been 
taken into consideration at all which adds discrepancy for [a] in the 
present data set. We conclude that by improving the exterior space 
model and properly treating the velar port opening, it is possible to 
computationally attain four lowest vowel formants with an error less 
than a semitone. The corresponding wave equation model on MRI- 
produced vocal tract geometries is expected to have a comparable 
accuracy. 

Keywords. Formant analysis, acoustic resonance computation, FEM, MRI. 
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1 Introduction 

The purpose of this paper is to evaluate the accuracy of vowel simulations 
based on the wave equation model (JTJ). We use 3D vocal tract (VT) geome- 
tries that have been obtained by Magnetic Resonance Imaging (MRI) from 
a native male speaker of Finnish while he pronounces prolonged vowels [a] , 
[i], [u], and [ce]. The evaluation is carried out by comparing the computed 
resonances of (pQ) with the measured formants, extracted from sound sam- 
ples, instead of comparing simulated and actual speech signals. In this work, 
the sound samples have been recorded simultaneously with the MRI, using the 
equipment and the arrangements detailed in [SI El |2TJ [27J . This is in con- 
trast to, e.g., our earlier work [T5] where only a single anatomic configuration 
(corresponding to Swedish [0] ) was taken from the data set of [13] . 
We use the same wave equation model for vowels as in [15], namely 



where Q C IR 3 is the interior of the VT whose boundary dQ = r\ U T 2 U T 3 
consists of the mouth opening r 1; the VT tissue walls r 2 , and an (imaginary) 
control surface T 3 placed right above the glottis. The parameters c = 350 
m/s and po = 1.225 kg/m 3 are the speed of sound in and the density of 
dry air at T = 305 K, respectively. The functions in ([1]) are as follows: 
u = u(r,t) is the incoming acoustic power (per unit area) at the glottis 
input, ^ = v ■ V$, and v is the exterior unit normal on dVl. In time domain 
simulations, we compute the velocity potential $(r, t) for a given glottal 
input function u{r,t) produced by a source such as described in [2], [3]. From 
the velocity potential, the sound pressure and (perturbation) velocity can be 
extracted as the partial derivatives p' = po^t and v = V$. The physics of 
model ([T]) is further explained in [T3] and the references therein. 

In the past, the VT acoustics has been modelled in many different ways. 
Electrical transmission lines have been used already in [lOj, and the classi- 
cal Kelly-Lochbaum model in [17] makes use of reflection/transmission coef- 
ficients of a variable diameter tube. The 3D wave equation for linear wave 
propagation as well as the related Helmholtz equation for acoustic resonances 
have been known for a long time; see, e.g. [IB]. The Kelly-Lochbaum model 
is closely related to Webster's equation in, e.g., [H] but the latter can be 
easily deduced from variable-impedance electrical transmission lines as well 
as from the wave equation in 3D tubular domains as shown in great gener- 
ality in [19J. More advanced models are the transmission line networks that 
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have been applied for speech in, e.g., [HI [TJl 122]; see also [I] for a purely 
mathematical treatment. 

At their best, all of these modelling paradigms are known to produce 
very good simulated speech even though they are based on radically sim- 
plified representations of the underlying anatomic geometry Q with the sole 
exception of the wave equation. In most applications, simplifications are even 
desirable as it may improve conceptual clarity and reduce the computational 
burden. There are, however, situations where the faithful representation of 
the VT geometry is required, e.g., when modelling the effects of anatomical 
abnormalities and maxillofacial surgery on speech [SJ [231 123 E21 |33j 134"] . 

2 Background and motivation 

In our earlier work [15] . the same numerical computations were carried out 
using a minimal data set: a single MRI-based anatomic geometry Q corre- 
sponding to the Swedish vowel [0] . The computed resonances were compared 
to the first four formants of all Swedish vowels (including [0]) that were ex- 
tracted from speech samples of the same test subject. The speech samples 
were not recorded simultaneously with the MRI because of technological re- 
strictions but the subject was in a similar supine position during both MRI 
and speech recording; see [T3] . 

We made the following observations in [T5] : 

1. The computed resonances R1...R4 corresponding the formants F1...F4 of 
[0] are systematically 3| semitones too high compared to the measured 
values; 

2. the formant ratios of the computed and measured data (i.e., Ri/Ri and 
Fi/Fi for i = 2,3,4) correspond to each other quite well; and 

3. if the systematic error in R1...R4 of [0] is compensated by linear scaling, 
then the scaled, computed data gets identified correctly as [0] in the 
measured formant table from the same subject. 

Two main potential sources were identified for the discrepancy: (i) the 
Dirichlet boundary condition on Ti in ([1]) results in too short acoustic length 
of the computational VT; and (ii) the minimal data set used in f73) / is in- 
sufficient to draw any conclusions on the error sources. The purpose of this 
work is to exclude the latter possibility (ii) by extending and improving the 
data set in an essential manner. We also aim at a deeper understanding of 
the sources of descrepancy to guide future model improvements and to un- 
derstand the quality of simulation that one can reasonably expect to attain. 
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We remark that the formant computation of [TH] was later validated by in- 
dependent FEM resonance computations that were based on the generalized 
Webster's model instead of (pQ); the computed resonances R1...R4 correspond- 
ing to F1...F4 were practically the same as reported in pQ Table 3.1 on p. 
31]. As shown in [19], the generalized Webster's model is a low-frequency 
approximation of the wave equation in a tubular domain. In the case of hu- 
man VT, the approximation remains accurate for formants F1...F3 and even 
for F4 at least in some vowel configurations where cross-mode resonances do 
not dominate; see [T5], Fig. 1]. 

We comment on the interesting parallel work [32] at the end of the article. 

3 Model and methods 

As explained in [151 P- 3], the resonances of Eq. (pP) can be solved by finding 
the complex frequencies A such that the Helmholtz problem 

A 2 $ A = c 2 A$ A on n, $ A = on r 1; 

= on T 2 , and A$ A + cf^ = on T 3 

is solvable for some nonzero eigenf unctions $ A (r). It is known that all such 
eigenvalues A form an infinite sequence {^j}jez\{o} with |Aj| — > 00 as \j\ — > 
00, ReXj < 0, and ImX-j = —ImXj. The lowest formants F±, F2, . . ., 
correspond to the numbers Rj = ImXj for j = 1,2,... in the order of 
increasing imaginary parts. 

As solving Eqs. (pQ) and analytically is possible only in a radically 
simplified geometry [31], we solved the problem numerically by the Finite 
Element Method (FEM). This is the approach used by [E], [21], [9], [29], 
[35] . and by many others. Eqs. (J2J) were solved in variational form as given 
in [13 Eq. (5)] using a custom implementation of FEM programmed in MAT- 
LAB. We used piecewise linear shape functions on tetrahedral meshes. The 
tetrahedral meshes were generated using TetGen [30] from a triangular sur- 
face mesh. As a result, we obtained the matrices A and B for a high-order 
eigenvalue problem Ay(X) = XBy(X) as explained in [151 Eq. (6)]. The low- 
est eigenvalues Xj, j = 1,2,3,4 were then computed using the eigs routine 
of MATLAB. It takes around 3 seconds on a workstation with an Intel Xeon 
X3450 processor to build the matrices and to solve the eigenvalue problem. 

The imaginary parts of the computed Xj are given in Table Q] together 
with the number of elements used in each VT geometry. The computed 
values are good approximations of the eigenvalues defined in Eqs. ([2]) when 
the number of elements is high enough. It was observed with the anatomic 
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geometry of [u] that using four times as many elements does not change the 
numerical result, and thus the resonances given in Tabled] can be regarded 
as accurate in this respect. 



Vowel 


Rx 


Rz 


#3 


i?4 


# of elem. 


[a] 


720 


1547 


2721 


4138 


47514 




246 


2135 


3592 


4667 


37335 


[u] 


324 


659 


2262 


3091 


50579 


[ce] 


562 


1612 


2519 


3602 


53087 



Table 1: Resonances (in Hz) of the Helmholtz problem (jSJ) by FEM, and the 
number of the elements in each geometry. 



4 Geometric data from MRI 

The raw MRI data was collected in pilot experiments in June 2010. A na- 
tive male Finnish speaker pronounced prolonged vowels in a supine position 
inside a MRI machine. A data set of 53 simultaneously recorded MRI data 
and sound samples was produced as reported in [5]. For this study, four 
samples corresponding to the vowels [a], [i], [u], and [ce] were chosen from 
this data set. The selection criteria were 1) a visual quality assessment of the 
spectrogram data, and 2) the requirement that the Fi-F 2 vowel space should 
be covered in a satisfactory manner. The spectrograms of these samples can 
be found in [27, p. 64, p. 66, p. 68, and p. 79]. The first three of these samples 
were pronounced at the fundamental frequency fo = 110 Hz and the last one 
at f = 137.5 Hz. 

A Siemens Magnetom Avanto 1.5T scanner was used in these experi- 
ments. A 12-element Head Matrix Coil was combined with a 4-element Neck 
Matrix Coil in order to cover the anatomy of interest. 3D VIBE (Volumetric 
Interpolated Breath-hold Examination) [28] was found to be the most suit- 
able MRI sequence for rapid 3D acquisition. It was originally developed for 
fast 3D imaging of the abdominal region where breath-hold during the scan 
is essential, as the naming of the sequence suggests. Sequence parameters 
were optimized in order to minimize the acquisition time, and we were able 
to carry out MRI with 1.8 mm isotropic voxels in just 7.6 s. 

The tissue/air interface from the MR data was extracted by combining 
sagittal DICOM sequence -images to form a 3D voxel image. A triangular 
surface mesh of the interface was then extracted using custom MATLAB 
code. The three boundary components Ti, and were identified manu- 
ally in the triangle mesh so that the different boundary conditions could be 
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Figure 1: Air/tissue interface of the test subject while pronouncing [y]. The 
nasal cavities have been excluded. 

applied in the right places. Since teeth are not visible in MRI (and hence, 
they are not part of the computational geometry of this work), some result- 
ing artefacts had to be corrected manually. The velar port was open in the 
geometries of [a] and [i] , and the resulting hole in the surface model was man- 
ually closed. A shaded representation of a typical surface mesh is presented 
in Fig. [TJ 

The geometric error in the tissue/air interface is a fraction of the voxel- 
based resolution of the original MRI data: interpolating in 2D sections by the 
gray-scale values of pixels results in about 1.8 bits of additional information 
compared to the MRI pixel size, corresponding to the geometric error of ~ 
0.5 mm with the current voxel size of 1.8 mm; see [5]. We conclude that the 
resonances in Table [T] do not contain essential errors due to inaccuracies of 
surface geometries. 

5 Sound recording and formant extraction 

The interior of a MRI machine is a challenging environment for speech record- 
ing. We used the recording arrangement discussed in [211 EZ] and the exper- 
imental arrangements described in [5j Section 2]; see also [6]. 

Let us briefly describe the recording arrangement. A two-channel sound 
collector samples the speech and noise signals in a dipole configuration. The 
sound collector is an acoustically passive, non-microphonic device which does 
not cause artifacts in the MR images. The sound signals are coupled to a 
RF-shielded microphone array by acoustic waveguides of length 3 m. There 
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are tuned acoustic impedance terminations at the both ends of the wave- 
guides to sufficiently reduce longitudinal resonances. The microphone signals 
are coupled to an amplifier that is situated outside the MRI room. This 
analogue electronics is used to optimally subtract the noise signal from the 
contaminated speech signal in real time, and the cleaned-up signal is fed 
back to test subject's earphones. The same audio signal is digitized by a 
24bit ADC, and the residual longitudinal resonances of the waveguides are 
compensated numerically in the post-processing stage. 



Vowel 


Fx 


F 2 


^3 


F 4 


[a] 


651 ± 7 


1024 ± 35 


2647 ± 117 


3679 =f 36 




247 ± 9 


2183 


3304 =p 46 


4407 =F 251 


[u] 


306 =F 37 


675 =f 39 


2173 ± 13 


3242 ± 139 


[oe] 


483 =f 35 


1249 ± 74 


1994 t 50 


3188 =f 17 



Table 2: Formants (in Hz) computed as means of those extracted from the 
beginnings and ends of the samples. The upper (lower) sign refers to the 
beginning (resp., the end) of the sample. 

For this work, the formants F1...F4 were first extracted separately from 
the beginnings and the ends of the sound samples where the acoustic MRI 
noise is absent. The extraction was done by LPC using MATLAB similarly 
to the approach explained in [27, Chapter 6]. However, the present values 
were obtained by applying LPC analysis to FFT power spectra of the signals 
with the algorithm detailed in [20J. The residual waveguide resonances were 
removed from the spectra before LPC analysis. The results were compared 
visually to the peaks of the smoothed spectra to detect crude errors. The 
final results in Table [2] are the averages of these two values, augmented with 
their half distances. The formant data for [i] is subject to following remarks: 
(i) the LPC found a peak at 855 ±133 Hz but this was removed from the data 
set as an outlier; (ii) F 2 could not be extracted from the beginning sample 
by the LPC even though it can be found easily in spectral curves by visual 
inspection; (iii) the LPC finds very strong double peaks about 500 Hz apart 
at F 3 , and the F 3 - values given in Table [2] are defined as their means. 

The data of Table [2] can be found in [26] except [oe] (which is from a 
different series where /o = 137.5 Hz) and F 3 , F 4 of [i] (which required a 
higher order LPC run). These results without numerical compensation of 
the residual longitudinal resonances of the waveguides can be found in [271 
Tables 6.2 and 6.3 on p. 49-50]. We remark that a typical mean deviation in 
vowel formant frequencies (when measured in ideal conditions in an anechoic 
chamber rather than inside a MRI machine) is of the order of 0.5 semitones; 
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Vowel 


D x 


D 2 


D 3 


D, 


mean discr. 


[a] 


1.7 


7.1 


0.5 


2.0 


2.8 


[i] 


-0.1 


-0.4 


1.4 


1.0 


0.7 




1.0 


-0.4 


0.7 


-0.8 


0.7 


[oe] 


2.6 


4.4 


4.0 


2.1 


3.3 



Table 3: Discrepancy (in semitones) between computed resonances and mean 
formant frequencies from Tabled Positive number implies that the computed 
resonance is higher than the measured formant. 

see [6]. 

6 Results and conclusions 

Just as in [15J, our new data indicates that the computed resonances Ri 
from 02]) tend to be higher than the measured formants Fj. The discrepancy 
given in Table E] is in semitone scale to make the comparison easy with the 
"3| semitone rule" that was discovered in [T5]. Recall that the difference of 
frequencies F and R is D = 12 \n(R/F)/ ln(2) semitones. 

The main sources of discrepancy in these computations and experiments 
follows: (i) less than perfect performance of the test subject in the 
MRI machine, (ii) sporadic problems in formant extraction by the LPC, and 
(iii) physically unrealistic boundary conditions in Eqs. fll])-© especially at 
the mouth opening. 

The mean discrepancy in Table [3] is at its largest for vowels [a] and [oe] 
where the computed resonances are consistently higher than the measured 
formants. Also, the mouth opening is largest for these vowels in our data 
set, and the Dirichlet boundary condition on r x in (j2J) is expected to be the 
most significant error source. All this is in good agreement with the results 
and the conclusions of [15] . 

The particularly significant error in F2 of [a] can be explained by the fact 
that the velar port of the subject was unexpectedly open in the MR image, 
and the anti-node of the standing pressure wave (corresponding to F 2 ) would 
be at the velar port if it was closed. The nasality of the pronunciation 
is clearly heard from the speech sample. In the computational geometry, 
however, the hole was closed manually which leads to the perfectly reflecting 
Neumann boundary condition for the velar opening. It is physically more 
realistic to use a similar boundary condition for the velar opening as on T 3 
in ©. 

It is worth noting that the computational model performs strikingly well 
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for [u]: The discrepancy is of the same order as the fluctuations in formant 
values in sustained vowel production [5J. The velar port is closed in this MRI 
geometry. Also, the mouth opening is very small which results in relatively 
smaller error due to the Dirichlet boundary condition. 

We conclude that the error profile in Table [3] supports earlier observations 
in [T5], and it can be qualitatively explained by considering the underlying 
physics. A more sophisticated exterior space model (compared to the Dirich- 
let boundary condition) is likely to remove most of the formant discrepancy 
in vowels where the mouth opening is large. Complications related to the 
open velar port should be treated by taking into account the nasal tract 
resonating structures. This can be done by including them in the computa- 
tional geometry or by setting an improved boundary condition at the velar 
port opening. 

The results of [32] support the view that computed and measured res- 
onances of a plastic model VT do not differ significantly from each other. 
Rather than carrying out speech recordings in MRI, the authors produce 3D 
physical printouts from MRI geometries by fast prototyping techniques for 
japanese vowels [a], [i], [u], [e], [o]. Separately imaged teeth geometries were 
manually aligned with the soft tissue geometries. The formant structure is 
measured from the plastic models in ideal conditions, and the same con- 
figuration is used for transfer function simulations by the Finite-Difference 
Time-Domain (FDTD) method. 

It is observed that the formant frequencies of the computed and measured 
transfer functions in [32] differ from each other by less than 3.2% (i.e., 0.55 
semitones) which is comparable to our results on vowels [i] and [u]. Such an 
indirect experimental arrangement excludes most of the sources of discrep- 
ancy considered in our work, and the results of [32] can therefore be regarded 
as a limit of what is reasonable to expect in a computational modelling effort 
of a natural vowel utterance. The effect of anatomic details such as piriform 
fossae, epiglottic valleculae and inter-dental spaces were computed, and the 
contribution of the latter two was found to be almost negligible in [32] . 
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